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This paper investigates hardware support for fine-grain distributed shared memory (DSM) 
in networks of workstations. To reduce design time and implementation cost relative to 
dedicated DSM systems, we decouple the functional hardware components of DSM 
support, allowing greater use of off-the-shelf devices. We present two decoupled systems, 
Typhoon-0 and Typhoon-1. Typhoon-0 uses an off-the-shelf protocol processor and 
network interface; a custom access control device is the only DSM-specific hard ... 
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This paper proposes a set of efficient primitives for process synchronization in 
multiprocessors. The only assumptions made in developing the set of primitives are that 
hardware combining is not implemented in the inter-connect, and (in one case) that the 
interconnect supports broadcast. The primitives make use of synchronization bits 
(syncbits) to provide a simple mechanism for mutual exclusion. The proposed 
implementation of the primitives includes efficient (i.e. 



http://portal.acm.org/resute 2/6/06 



Results (page 1): + "multiprocessor" + "shared memory" + "status" + "DRAM" 



Page 2 of 6 



multiprocessors 

Vijayaraghavan Soundararajan, Mark Heinrich, Ben Verghese, Kourosh Gharachorloo, Anoop 
Gupta, John Hennessy 

April 1998 ACM SIGARCH Computer Architecture News , Proceedings of the 25th 

annual international symposium on Computer architecture ISCA '98, volume 

26 Issue 3 

Publisher: IEEE Computer Society, ACM Press 

Full text available: pdf(1.76 MBil fffl Additional Information: full citation, abstract, references, citings , index 
Eubjisher Site tenps 

Given the limitations of bus-based multiprocessors, CC-NUMA is the scalable architecture 
of choice for shared-memory machines. The most important characteristic of the CC- 
NUMA architecture is that the latency to access data on a remote node is considerably 
larger than the latency to access local memory. On such machines, good data locality can 
reduce memory stall time and is therefore a critical factor in application performance. In 
this paper we study the various options available to system desi ... 
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Ashley Saulsbury, Su-Jaen Huang, Fredrik Dahlgren 

May 1999 Proceedings of the 13th international conference on Supercomputing 
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This paper develops and validates an analytical model for evaluating various types of 
architectural alternatives for shared-memory systems with processors that aggressively 
exploit instruction- level parallelism. Compared to simulation, the analytical model is many 
orders of magnitude faster to solve, yielding highly accurate system performance 
estimates in seconds.The model input parameters characterize the ability of an application 
to exploit instruction-level parallelism as well as the interac ... 
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ierms 

DASH is a scalable shared-memory multiprocessor currently being developed at Stanford's 
Computer Systems Laboratory. The architecture consists of powerful processing nodes, 
each with a portion of the shared-memory, connected to a scalable interconnection 
network. A key feature of DASH is its distributed directory-based cache coherence 
protocol. Unlike traditional snoopy coherence protocols, the DASH protocol does not rely 
on broadcast; instead it uses point-to-point messages sent between th ... 
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Edward Gehringer, Janne Abullarade, Michael H. Gulyn 

^ September 1988 ACM SIGARCH Computer Architecture News, volume 16 issue 4 
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This paper compares eight commercial parallel processors along several dimensions. The 
processors include four shared-bus multiprocessors (the Encore Multimax, the Sequent 
Balance system, the Alliant FX series, and the ELXSI System 6400) and four network 
multiprocessors (the BBN Butterfly, the NCUBE, the Intel iPSC/2, and the FPS T Series). 
The paper contrasts the computers from the standpoint of interconnection structures, 
memory configurations, and interprocessor communication. Also, the share ... 

11 Evaluation of design alternatives for a multiprocessor microprocessor | 
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^ May 1996 ACM SIGARCH Computer Architecture News , Proceedings of the 23rd 

annual international symposium on Computer architecture ISCA '96, volume 

24 Issue 2 

Publisher: ACM Press 

Full text available: "Pj pdf(1.37 M3) Additiona l Information: M cMion, abstract, references, citings, index 
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In the future, advanced integrated circuit processing and packaging technology will allow 
for several design options for multiprocessor microprocessors. In this paper we consider 
three architectures: shared-primary cache, shared-secondary cache, and shared-memory. 
We evaluate these three architectures using a complete system simulation environment 
which models the CPU, memory hierarchy and I/O devices in sufficient detail to boot and 
run a commercial operating system. Within our simulation envir ... 
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Historically, processor accesses to memory-mapped device registers have been marked 
uncachable to insure their visibility to the device. The ubiquity of snooping cache 
coherence, however, makes it possible for processors and devices to interact with 
cachable, coherent memory operations. Using coherence can improve performance by 
facilitating burst transfers of whole cache blocks and reducing control overheads (e.g., for 
polling).This paper begins an exploration of network interfaces (His) that u ... 
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Satish Chandra, James R. Larus, Anne Rogers 

W November 1994 ACM SIGPLAN Notices , ACM SIGOPS Operating Systems Review , 
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terms 

Message passing and shared memory are two techniques parallel programs use for 
coordination and communication. This paper studies the strengths and weaknesses of 
these two mechanisms by comparing equivalent, well-written message-passing and 
shared-memory programs running on similar hardware. To ensure that our measurements 
are comparable, we produced two carefully tuned versions of each program and measured 
them on closely-related simulators of a message-passing and a shared-memory machine, 
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terms 

This paper describes the synchronization and communication primitives of the Cray T3E 
multiprocessor, a shared memory system scalable to 2048 processors. We discuss what 
we have learned from the T3D project (the predecessor to the T3E) and the rationale 
behind changes made for the T3E. We include performance measurements for various 
aspects of communication and synchronization.The T3E augments the memory interface 
of the DEC 21164 microprocessor with a large set of explicitly-managed, external r ... 

16 Tempest and typhoon: user-level shared memory Q 
^fc S. K. Reinhardt, J. R. Larus, D. A. Wood 

^ April 1994 ACM SIGARCH Computer Architecture News , Proceedings of the 21ST 

annual international symposium on Computer architecture ISCA '94, volume 
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Publisher: IEEE Computer Society Press, ACM Press 

Full text available- '■B Spdffl.44 MS) Additional Information: Ml citation, abstract, references, citings, index 
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Future parallel computers must efficiently execute not only hand-coded applications but 
also programs written in high-level, parallel programming languages. Today's machines 
limit these programs to a single communication paradigm, either message-passing or 
shared-memory, which results in uneven performance. This paper addresses this problem 
by defining an interface, Tempest, that exposes low-level communication and memory- 
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Recent advances in Field-Programmable Gate Arrays (FPGA) and programmable 
interconnects have made it possible to build efficient hardware emulation engines. In 
addition, improvements in Computer-Aided Design (CAD) tools, mainly in synthesis tools, 
greatly simplify the design of large circuits. The RPM (Rapid Prototype Engine for 
Multiprocessors) Project leverages these two technological advances. Its goal is to 
develop a common hardware platform for th ... 

Keywords: field-programmable gate arrays, logic emulation, message-passing 
multicomputers, rapid prototyping, shared-memory multiprocessors 
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Alewife is a multiprocessor architecture that supports up to 512 processing nodes 
connected over a scalable and cost-effective mesh network at a constant cost per node. 
The MIT Alewife machine, a prototype implementation of the architecture, demonstrates 
that a parallel system can be both scalable and programmable. Four mechanisms combine 
to achieve these goals: software-extended coherent shared memory provides a global, 
linear address space; integrated message passing allows compiler and operat ... 
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This paper presents the architectural design and RISC based implementation of a 
prototype supercomputer, namely the Orthogonal Multiprocessor (OMP). The OMP system 
is constructed with 16 Intel 1860 RISC microprocessors and 256 parallel memory 
modules, which are 2 -D interleaved and orthogonally accessed using custom-designed 
spanning buses. The architectural design has been validated by a CSIM-based 
multiprocessor simulator. The design choices are based on worst-case delay a ... 
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The Caltech/JPL Mark III Hypercube originally consisted of an ensemble of processing 
elements each containing two Motorola M68020 processors — one M68020 processor and 
a M68881 Floating-Point coprocessor used for data processing, the other M68020 
processor dedicated to hypercube communications. In the interest of achieving even 
greater computational capability a third processing element, the Weitek XL system, was 
added to enhance floating point performance. Each of what is now three p ... 
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